Re: Second byte of multibyte characters causing trouble

Поиск
Список
Период
Сортировка
От David Emery
Тема Re: Second byte of multibyte characters causing trouble
Дата
Msg-id v04210a00b7cda30acbae@[192.168.1.3]
обсуждение исходный текст
Ответ на Second byte of multibyte characters causing trouble  ("Karen Ellrick" <k-ellrick@sctech.co.jp>)
Ответы Re: Second byte of multibyte characters causing trouble
Список pgsql-general
The usual way to deal with this is to convert the J text from S-JIS (which
will almost always cause problems) to either EUC-JP or UTF8 encoding before
inserting it into the DB or otherwise messing with it. You can then convert
it back to SJIS before sending it to the client. For Perl there are some
scripts/modules to make encoding conversion fairly painless. Check on CPAN
for the Jcode.pm module and jcode.pl scripts. I think there may be some
others available now too, so it may be worth searching around a bit.

Gambatte,
-dave

At 15:24 +0900 01.9.18, Karen Ellrick wrote:
>I am using Perl CGI scripts with DBI to take data from a web interface and
>from text files to put into my database, and I'm dealing with Japanese (i.e.
>two-byte characters).  PostgreSQL is installed with multibyte enabled, but
>somewhere in the communication chain from Perl to DBI to PostgreSQL,
>something is trying to interpret multibyte text byte by byte, which is
>causing trouble.  The example that has been discovered so far is that if the
>second of the two bytes is 0x5c (in ASCII, "\"), it gets swallowed and a
>ripple effect of byte pairs ensues (at least if the byte after the 0x5c
>isn't a valid character to follow \ to make a metacharacter - if it is, who
>knows what will happen!).  I fixed that one by replacing any \ in the
>strings with "\\" to get a literal 0x5C byte past whatever is trying to
>interpret it.  But I am wondering what other similar pitfalls I have to
>watch out for, and I'm hoping others have ideas.  For example, is my SQL
>insert or update statement going to choke if the second byte of one of the
>characters is the same as ASCII for a single quote?  The possibilities are
>endless, depending on what part of the process is doing the damage.  And
>trying to test this stuff is like looking for a needle in a haystack - it's
>not easy to figure out what Japanese characters have second bytes that would
>have special meaning if interpreted as ASCII.
>
>If someone knows how to set things up so that all text is guaranteed to go
>through unscathed (make Perl or DBI multi-byte aware, or whatever - i.e. the
>real fix), that would be ideal.  Otherwise, at least some ideas would be
>welcome regarding what other bytes to write bandaid code for.  I know I'm
>not the only one trying to use Perl to maintain PostgreSQL databases with
>Japanese or Chinese text! :-)
>
>Thanks in advance,
>Karen
>
>--------------------------------
>Karen Ellrick
>S & C Technology, Inc.
>1-21-35 Kusatsu-shinmachi
>Hiroshima  733-0834  Japan
>(from U.S. 011-81, from Japan 0) 82-293-2838
>--------------------------------
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org


В списке pgsql-general по дате отправления:

Предыдущее
От: "Matt Block"
Дата:
Сообщение: Re: Performance question (stripped down the problem)
Следующее
От: "Tauren Mills"
Дата:
Сообщение: Re: Problem with database: FATAL 1: cannot find attribute 24